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Q ■ Abstract 

We present a convex formulation of dictionary learning for sparse signal decomposition. 
Convexity is obtained by replacing the usual explicit upper bound on the dictionary size by a 
convex rank-reducing term similar to the trace norm. In particular, our formulation introduces an 
explicit trade-off between size and sparsity of the decomposition of rectangular matrices. Using 
a large set of synthetic examples, we compare the estimation abilities of the convex and non- 
convex approaches, showing that while the convex formulation has a single local minimum, this 
may lead in some cases to performance which is inferior to the local minima of the non-convex 
formulation. 
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£g ; 1 Introduction 

^ I Sparse decompositions have become prominent tools in signal processing HI, image process- 

ing 0, machine learning, and statistics 0. Many relaxations and approximations of the as- 
sociated minimum cardinality problems are now available, based on greedy approaches [|4l or 
convex relaxations through the £ 1 -norm dlU. Active areas of research are the design of ef- 
ficient algorithms to solve the optimization problems associated with the convex non differen- 
^ ■ tiable norms (see, e.g., Q), the theoretical study of the sparsifying effect of these norms OH, 

^ I and the learning of the dictionary directly from data (see, e.g., (SEl). 

In this paper, we focus on the third problem — namely, we assume that we are given a matrix 

V G R NxP and we look for factorizations of the form X = UV T , where U G R NxM and 

V G M. PxM , that are close to Y and such that the matrix U is sparse. This corresponds to 
decomposing N vectors in M p (the rows of Y) over a dictionary of size M. The columns of 

V are the dictionary elements (of dimension P), while the rows of U are the decomposition 
coefficients of each data point. Learning sparse dictionaries from data has shown great promise 
in signal processing tasks, such as image or speech processing [:2J, and core machine learning 
tasks such as clustering may be seen as special cases of this framework |9l . 

Various approaches have been designed for sparse dictionary learning. Most of them con- 
sider a specific loss between entries of X and Y, and directly optimize over U and V, with 
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additional constraints on U and V iflOl fTTTl : dictionary elements, i.e., columns of V, may or 
may not be constrained to unit norms, while a penalization on the rows of U is added to impose 
sparsity. Various forms of jointly non-convex alternating optimization frameworks may then be 
used liT0l[TTl l2l. The main goal of this paper is to study the possibility and efficiency of con- 
vexifying these non-convex approaches. As with all convexifications, this leads to the absence 
of non-global local minima, and should allow simpler analysis. However, does it really work in 
the dictionary learning context? That is, does convexity lead to better decompositions? 

While in the context of sparse decomposition with fixed dictionaries, convexification has led 
to both theoretical and practical improvements (6) [2 13, we report both positive and negative 
results in the context of dictionary learning. That is, convexification sometimes helps and some- 
times does not. In particular, in high-sparsity and low-dictionary-size situations, the non-convex 
fomulation outperforms the convex one, while in other situations, the convex formulation does 
perform better (see Section[5]for more details). 

The paper is organized as follows: we show in Section [2] that if the size of the dictionary is 
not bounded, then dictionary learning may be naturally cast as a convex optimization problem; 
moreover, in Section [3j we show that in many cases of interest, this problem may be solved 
in closed form, shedding some light on what is exactly achieved and not achieved by these 
formulations. Finally, in Section |H we propose a mixed formulation that leads to both 
low -rank and sparse solutions in a joint convex framework. In Section[51 we present simulations 
on a large set of synthetic examples. 

Notations Given a rectangular matrix X G ]R 7VxP and n G {1, . . . , N},p G {1, . . . , P}, 
we denote by X(n,p) or X np its element indexed by the pair (n,p), by X(:,p) G R N its p-th 
column and by X(n, :) G R p its n-th row. Moreover, given a vector x G R , we denote by 
its £ q -notm, i.e., for q G [1, oo), ||x|| g = (Y^n=i \ x n\ q ) 1 ^ q an d IMloo = m&x-ne{i,...,N} \ x n\- We 
also write a matrix U G R NxP as U = [u±, . . . , um], where each u m G R . 



2 Decomposition norms 

We consider a loss I : R x R — » R which is convex with respect to the second variable. We 
assume in this paper that all entries of Y are observed and the risk of the estimate X is equal 
to WF 2_m=i J2p=i £(Ynp, X np ). Note that our framework extends in a straightforward way to 
matrix completion settings by summing only over observed entries lfl2l . 

We consider factorizations of the form X = UV T ; in order to constrain U and V, we 
consider the following optimization problem: 

N P , M 

min ^y"i^(w^ T )n P )+ o y^(iMic +iKi&), (i) 

n=l p=l m=l 

where || • \\c and || • ||# are any norms on R^ and R p (on the column space and row space 
of the original matrix X). This corresponds to penalizing each column of U and V. In this 
paper, instead of considering U and V separately, we consider the matrix X and the set of its 
decompositions on the form X = UV T , and in particular, the one with minimum sum of norms 
||«m|lc> Ibmll/?' m G {1, ... ,M}. That is, for X G R NxP , we consider 



fo(X)= min ££(ll«m|lc+lhn||fl)- ( 2 ) 
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If M is strictly smaller than the rank of X, then we let f^f(X) = +00. Note that the minimum 
is always attained if M is larger than or equal to the rank of X. Given X, each pair {u m ,v m ) 
is defined up to a scaling factor, i.e., (u m ,v m ) may be replaced by (u m s m , v m s~})\ optimizing 
with respect to s m leads to the following equivalent formulation: 

M 

fo{X) = min / \ \Wm\\c\\Vm\\R- ( 3 ) 

(U > V)m NxM xRPx M , X=UV T ^ 

m=l 

Moreover, we may derive another equivalent formulation by constraining the norms of the 
columns of V to one, i.e., 

AI 

fn( x ) = min V hmWc- (4) 

(U,V)m NxM xR PxM , X=UV T , Vm,|b m ||R=l ^— ' 

m=l 

This implies that constraining dictionary elements to be of unit norm, which is a common as- 
sumption in this context ifTTl l2ll. is equivalent to penalizing the norms of the decomposition 
coefficients instead of the squared norms. 

Our optimization problem in Eq. (0Q) may now be equivalently written as 

TV P 

x *fi xP ]vp ££^p.*np) + X fn(X). (5) 

n=l p=l 

with any of the three formulations of f^f(X) in Eqs. ©-(UJ). The next proposition shows that if 
the size M of the dictionary is allowed to grow, then we obtain a norm on rectangular matrices, 
which we refer to as a decomposition norm. In particular, this shows that if M is large enough 
the problem in Eq. © is a convex optimization problem. 



Proposition 1 For all X G R NxP , the limit f^(X) = ]xm M ->oo fff(X) exists and /£>(•) is a 
norm on rectangular matrices. 

Proof Since given X, fp(X) is nonnegative and clearly nonincreasing with M, it has a non- 
negative limit when M tends to infinity. The only non trivial part is the triangular inequal- 
ity, i.e., + X 2 ) < f%(Xi) + f%(X 2 ). Let e > and let (U^) and (U 2 ,V 2 ) be 

the two e-optimal decompositions, i.e., such that fp(X\) ^ Ylm=i || u im||c||^im||fl ~~ e an d 
fp(X 2 ) ^ Sm=i ll ?x 2m||c||^2m||.R — s. Without loss of generality, we may asssume that 
Mi = M 2 = M. We consider U = pi U 2 ], V = [V x V 2 ], we have X = X 1 + X 2 = UV T 
and fg>(X) < Zfn=i(him\\c\\vi m \\ R + |^ 2m ||c||^m||R) < fo( x i) + /SW + 2e. We 
obtain the triangular inequality by letting e tend to zero. ■ 

Following the last proposition, we now let M tend to infinity; that is, if we denote = 
fp(X), we consider the following rank-unconstrained and convex problem: 

TV P 

min T^y.Yj( Y np,X np ) + \\X\\ D . (6) 

xm Nxr ' 



P J^p^^l^ Yn ^ Xn p) + W x \ 

n=l p=l 



However, there are three potentially major caveats that should be kept in mind: 

Convexity and polynomial time Even though the norm || • \\d leads to a convex func- 
tion, computing or approximating it may take exponential time — in general, it is not because a 
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problem is convex that it can be solved in polynomial time. In some cases, however, it may be 
computed in closed form, as presented in Section while in other cases, an efficiently com- 
putable convex lower-bound is available (see Section [4]). 

Rank and dictionary size The dictionary size M must be allowed to grow to obtain 
convexity and there is no reason, in general, to have a finite M such that fp(X) = fp(X). 
In some cases presented in Section [3l the optimal M is finite, but we conjecture that in general 
the required M may be unbounded. Moreover, in non sparse situations, the rank of X and the 
dictionary size M are usually equal, i.e., the matrices U and V have full rank. However, in 
sparse decompositions, M may be larger than the rank of X, and sometimes even larger than 
the underlying data dimension P (the corresponding dictionaries are said the be overcomplete). 

Local minima The minimization problem in Eq. £T|), with respect to U and V, even with 
M very large, may still have multiple local minima, as opposed to the one in X, i.e., in Eq. ©, 
which has a single local minimum. The main reason is that the optimization problem defining 
(U, V) from X, i.e., Eq. (O, may itself have multiple local minima. In particular, it is to be 
constrasted to the optimization problem 



N P 



E E ( UVT )np) + M\UV T \\D, (V) 



n=l p=l 

which will turn out to have no local minima if M is large enough (see Section 1431 for more 
details). 

Before looking at special cases, we compute the dual norm of || • (see, e.g., lfT3l for the 
definition and properties of dual norms), which will be used later. 

Proposition 2 (Dual norm) The dual norm \\Y\\* D , defined as 

\\Y\\* D = sup tiX T Y, 



u. 



is equal to \\Y\\* D = sup| Wo<1> \\ v \\ R ^i v Y 
Proof We have, by convex duality (see, e.g., |[T3l ). 

= sup trX T Y = inf suptrX T F- \\\X\\ D + X 
\\x\\ D <:i ^ x 

M 

= inf lim V" ( sup v^Y^ u m - \\\u m \\c\\v m \\R) + A 

X > 0M ^ oo ~ 1 u m ,v m 

Let a = sup|| w || c ^ 1) \\ v \\ a ^i v T Y T u. If A < a, 



sup vJ n Y T u m - \\\u m \\c\\v m \\R = +oo, 



while if A > a, then 



The result follows. 



sup vJ n Y T u m - \\\u m \\c\\v m \\ R = 0. 
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3 Closed-form decomposition norms 



We now consider important special cases, where the decomposition norms can be expressed 
in closed form. For these norms, with the square loss, the convex optimization problems may 
also be solved in closed form. Essentially, in this section, we show that in simple situations 
involving sparsity (in particular when one of the two norms || • ||c or || • is the £ 1 -norm), 
letting the dictionary size M go to infinity often leads to trivial dictionary solutions, namely 
a copy of some of the rows of Y . This shows the importance of constraining not only the i 1 - 
norms, but also the £ 2 -norms, of the sparse vectors u m , m G {1, . . . , M}, and leads to the joint 
low-rank/high-sparsity solution presented in Section [4] 



3.1 Trace norm: || • \\c = || • H2 and || • \\r = || • H2 

When we constrain both the £ 2 -norms of u m and of v m , it is well-known, that || • \\d is the sum 
of the singular values of X, also known as the trace norm lfl2l . In this case we only need M ^ 
min{./V, P} dictionary elements, but this number will turn out in general to be a lot smaller — 
see in p articular lfl4l for rank consistency results related to the trace norm. Moreover, with the 
square loss, the solution of the optimization problem in Eq. © is X = Y1™=\ N max {^m — 
XNP, 0}u m ^, where Y = £m 

lnjA^P} GmUmV ^ j s tije singular value decomposition of Y . 
Thresholding of singular values, as well as its interpretation as trace norm minimization is well- 
known and well-studied. However, sparse decompositions (as opposed to simply low-rank de- 
compositions) have shown to lead to better decompositions in many domains such as image 
processing (see, e.g., (H). 



3.2 Sum of norms of rows: 



c — \\ • 1 



When we use the ^ 1 -norm for ||n m ||c, whatever the norm on v m , we have: 

\\Y\\* D = sup v T Y T u = sup sup v T Y T u = sup H^Hoo 

= max max ||y(n, :)f = max |jy(n, :) T || n, 
ne{l,-,N} v n&{l,...,N} 

which implies immediately that 

N N 

\\X\\ D = sup trX T y = V sup trX(n, :)Y(n, :) T = V \\X(n,:) T \\ R . 

\\y\\h^ n=l\\YM T \\B.^ n=l 

That is, the decomposition norm is simply the sum of the norms of the rows. Moreover, an opti- 
mal decomposition is X = Yl n =l ^n^n where 5 n € is a vector with all null components 
except at n, where it is equal to one. In this case, each row of X is a dictionary element and the 
decomposition is indeed extremely sparse (only one non zero coefficient). 

In particular, when || • ||^ = || • H2, we obtain the sum of the ^ 2 -norms of the rows, which leads 
to a closed form solution to Eq. © as X(n, :) = max{||y(n, :) T ||2 — AiVP, 0}Y(n, :)/\\Y(n,: 
) T ||2 for all n £ {1, ... , N}. Also, when || • \\r = \\ ■ ||i, we obtain the sum of the £ 1 -norms 
of the rows, i.e, the ^ 1 -norm of all entries of the matrix, which leads to decoupled equations for 
each entry and closed form solution X(n,p) = max{\Y(n,p)\ — XNP, 0}Y (n, p) / \ Y (n, p)\. 
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These examples show that with the £ -norm on the decomposition coefficients, these simple 
decomposition norms do not lead to solutions with small dictionary sizes. This suggests to 
consider a larger set of norms which leads to low-rank/small-dictionary and sparse solutions. 
However, those two extreme cases still have a utility as they lead to good search ranges for the 
regularization parameter A for the mized norms presented in the next section. 



We now assume that we have || • ||.r = || • H2, i-e> we use the £ 2 -norm on the dictionary elements. 
In this situation, when || • ||c = || • ||i> as shown in Section l3~2l the solution corresponds to a very 
sparse but large (i.e., large dictionary size M) matrix U; on the contrary, when || • ||c = || • H2, 
as shown in Section IXTl we get a small but non sparse matrix U. It is thus natural to combine 
the two norms on the decomposition coefficients. The main result of this section is that the way 
we combine them is mostly irrelevant and we can choose the combination which is the easiest 
to optimize. 

Proposition 3 If the loss £ is differentiable, then for any function f : ]R + x ]R + — ► M + , such 
that || • ||c = /(|| • ||i, || • || 2) is a norm, and which is increasing with respect to both variables, 
the solution of Eq. ©for \\ ■ \\c = f{\\ • ||i, || • H2) is the solution of Eq. ©for \\ ■ \\c = 
[(1 — u)\\ ■ \\1 + v\\ ■ H2] 1 / 2 , for a certain u and a potentially different regularization parameter 



Proof If we denote L(X) = J2n=i Ep=i ^( Y n P , X np ) and L* its Fenchel conjugate fT51 . 
then the dual problem of Eq. © is the problem of maximizing — L*(Y) such that \\Y\\* D ^ A. 
Since the loss L is differentiable, the primal solution X is entirely characterized by the dual 
solution Y. The optimality condition for the dual problem is exactly that the gradient of L* 
is equal to uv T , where (u, v) is one of the maximizers in the definition of the dual norm, i.e., 
in sup^ni^i^ imi 2 )^i ||d| 2 <i v T Y T u. In this case, we have v in closed form, and u is the max- 
imizer of supj(i| u || 1 i| u |i 2 )^i u T YY T u. With our assumptions on /, these maximizers are the 
same as the ones subject to \\u\\i ^ ct\ and ||u||2 ^ 02 for certain a\, 02 G K.+. The optimality 
condition is thus independent of /. We then select the function f(a, b) = [(1 — v)a 2 + vb 2 } 1 / 2 
which is practical as it leads to simple lower bounds (see below). ■ 

We thus now consider the norm defined as \\uWq = (1 — z/)||n||f + ^H^Hl- We denote by F 
the convex function defined on symmetric matrices as F(A) = (1 — u) Y^ij=\ I A? I + vixA, 
for which we have F(uu T ) = (1 — z/)||«||f + ^||w||| = \\u\\q. 

In the definition of fp(X) in Eq. (f2]), we can optimize with respect to V in closed form, 

i.e., 



is attained at V = X (UU ) U (the value is infinite if the span of the columns of U is not 
included in the span of the columns of X). Thus the norm is equal to 



4 Sparse decomposition norms 



A. 





m=l 



(8) 
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Though II-X'Hd is a convex function of X, we currently don't have a polynomial time algorithm 
to compute it, but, since F is convex and homogeneous, J2m>o F( u mUm) ^ -^Em^o u m,Um)- 
This leads to the following lower-bounding convex optimization problem in the positive semi- 
defmite matrix A = UU T : 

\\X\\ D > min ]-F(A) + \tiX T A- 1 X. (9) 



This problem can now be solved in polynomial time ITT3I . This computable lower bound in 
Eq. © may serve two purposes: (a) it provides a good initialization to gradient descent or path 
following rounding techniques presented in Section 14.11 (b) the convex lower bound provides 
sufficient conditions for approximate global optimality of the non convex problems |[T3i 



4.1 Recovering the dictionary and/or the decomposition 

Given a solution or approximate solution X to our problem, one may want to recover dictio- 
nary elements U and/or the decomposition V for further analysis. Note that (a) having one of 
them automatically gives the other one and (b) in some situations, e.g., denoising of Y through 
estimating X, the matrices U and V are not explicitly needed. 

We propose to iteratively minimize with respect to U (by gradient descent) the following 
function, which is a convex combination of the true function in Eq. ([8]) and its upper bound in 
Eq. ©: 

L_1 F{UU T ) + | E F(u m ul) + \trX T (UU T )- 1 X. 

When rj = this is exactly our convex lower bound applied defined in Eq. (O, for which 
there are no local minima in U, although it is not a convex function of U (see Section 14.31 for 
more details), while at rj = 1, we get a non-convex function of U, with potentially multiple 
local minima. This path following strategy has shown to lead to good local minima in other 
settings ifBl . 

Moreover, this procedure may be seen as the classical rounding operation that follows a 
convex relaxation — the only difference here is that we relax a hard convex problem into a simple 
convex problem. Finally, the same technique can be applied when minimizing the regularized 
estimation problem in Eq. ©, and, as shown in Section|5] rounding leads to better performance. 



4.2 Optimization with square loss 

In our simulations, we will focus on the square loss as it leads to simpler optimization, but our 
decomposition norm framework could be applied to other losses. With the square loss, we can 
optimize directly with respect to V (in the same way theat we could earlier for computing the 
norm itself); we temporarily assume thatf/elR 7VxA/ is known; we have: 

1 .... „„-r„o A, 



mm — — ily- UV T \\ 2 F + -\\V\\l 
1 -tTtt , x »rnn-lrrT 



2NP 
1 

2NP 



try 



I — U(U U + XNPI)~ U 



Y 



tvY T (UU r /XNP + I)- X Y, 
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with a minimum attained at V = Y T U(U T U + \NPI)~ l = Y T (UU T + XNPI)~ l U. The 
minimum is a convex function of UU T € W NxN and we now have a convex optimization 
problem over positive semi-definite matrices, which is equivalent to Eq. ©: 

mm . 7 4ptry T (^/AiVP + /)- 1 y+^ min V F{u m u T m ). (10) 



m m>0 



It can be lower bounded by the following still convex, but now solvable in polynomial time, 
problem: 

min - tvY T (A/ X + I)~ 1 Y + —F(A). (11) 

This fully convex approach will be solved within a globally optimal low-rank optimization 
framework (presented in the next section). Then, rounding operations similar to Section 14.11 
may be used to improve the solution — note that this rounding technique takes Y into account 
and it thus preferable to the direct application of Section |4~T1 



4.3 Low rank optimization over positive definite matrices 

We first smooth the problem by using ( 1 — u) Ylfj=\ (Afj + £ 2 ) l ^ 2 + v tr A as an approximation 

of F(A), and (1 - + e 2 ) 1 ^ 2 ) 2 + HMIl as an approximation of F(uu T ). 

Following liToll . since we expect low-rank solutions, we can optimize over low-rank matri- 
ces. Indeed, lfl6l shows that if G is a convex function over positive semidefmite symmetric 
matrices of size N, with a rank deficient global minimizer (i.e., of rank r < N), then the 
function U h- > G(UU T ) defined over matrices U € M 7VxM has no local minima as soon as 
M > r. The following novel proposition goes a step further for twice differentiable functions 
by showing that there is no need to know r in advance: 

Proposition 4 Let G be a twice differentiable convex function over positive semidefmite sym- 
metric matrices of size N, with compact level sets. If the function H : U i— > G(UU T ) defined 
over matrices U £ ]R ArxM has a local minimum at a rank-deficient matrix U, then UU T is a 
global minimum of G. 

Proof Let N = UU T . The gradient of H is equal to VH{U) = 2VG(UU T )U and the 
Hessian of H is such that V 2 H(U)(V, V) = 2tiWG(UU T )VV T + X7 2 G(UU T )(UV T + 
VU T , UV T + VU T ). Since we have a local mimimum, VH(U) = which implies that 
tvVG(N)N = tvVH(U)U T = 0. Moreover, by invariance by post-multiplying U by an 
orthogonal matrix, without loss of generality, we may consider that the last column of U is zero. 
We now consider all directions V € ~R NxAI with first M — 1 columns equal to zero and last 
column being equal to a given v € R . The second order Taylor expansion of H(U + tV) is 

H{U + tV) = H(U) +t 2 tvVG{N)VV T 

t 2 

= +-V 2 G(N)(UV T + VU T , UV T + VU T ) + 0(t 3 ) 
= H(U)+t 2 v T VG{N)v + 0{t 3 ). 

Since we have a local minima, we must have v T VG(N)v ^ 0. Since v is arbitrary, this implies 
that VG(N) > 0. Together with the convexity of G and tr VG(N)N = 0, this implies that we 
have a global minimum of G lPT3i ■ 
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The last proposition suggests to try a small M, and to check that a local minimum that we can 
obtain with descent algorithms is indeed rank-deficient. If it is, we have a solution; if not, we 
simply increase M and start again until M turns out to be greater than r. 

Note that minimizing our convex lower bound in Eq. ^} by any descent algorithm in (U, V) 
is different than solving directly Eq. £[]): in the first situation, there are no (non-global) local 
minima, whereas there may be some in the second situation. In practice, we use a quasi-Newton 
algorithm which has complexity 0(N 2 ) to reach a stationary point, but requires to compute the 
Hessian of size NM x NM to check and potentially escape local minima. 

4.4 Links with sparse principal component analysis 

If we now consider that we want sparse dictionary elements instead of sparse decompositions, 
we exactly obtain the problem of sparse PC A lfT7l[T8ll . where one wishes to decompose a data 
matrix Y into X = UV T where the dictionary elements are sparse, and thus easier to interpret. 
Note that in our situation, we have seen that with || • \\r = || • H2, the problem in Eq. £[]) is 
equivalent to Eq. (TTOb and indeed only depends on the covariance matrix -pYY T . 

This approach to sparse PCA is similar to the non convex formulations of |[T8l and is to be 
contrasted with the convex formulation of lfT7l as we aim at directly obtaining a. full decom- 
position of Y with an implicit trade-off between dictionary size (here the number of principal 
components) and sparsity of such components. Most works consider one unique component, 
even though the underlying data have many more underlying dimensions, and deal with mul- 
tiple components by iteratively solving a reduced problem. In the non-sparse case, the two 
approaches are equivalent, but they are not here. By varying A and v, we obtain a set of solu- 
tions with varying ranks and sparsities. We are currently comparing the approach of |[T8ll . which 
constrains the rank of the decomposition to ours, where the rank is penalized implicitly. 

5 Simulations 

We have performed extensive simulations on synthetic examples to compare the various for- 
mulations. Because of identifi ability problems which are the subject of ongoing work, it is not 
appropriate to compare decomposition coefficients and/or dictionary elements; we rather con- 
sider a denoising experiment. Namely, we have generated matrices Yq = UV T as follows: 
select M unit norm dictionary elements v± , . . . , vm in ^ P uniformly and independently at ran- 
dom, for each n S {1, . . . , N}, select S indices in {1, ... , M} uniformly at random and form 
the n-th row of U € ]R 7VxM with zeroes except for random normally distributed elements at the 
S selected indices. Construct Y = Yq + {ItYoYq ) 1 / 2 as / {N P) 1 / 2 , where e has independent 
standard normally distributed elements and a (held fixed at 0.6). The goal is to estimate Yq 
from Y, and we compare the three following formulations on this task: (a) the convex mini- 
mization of Eq. (TTTT> through techniques presented in Section R31 with varying v and A, denoted 
as CONV, (b) the rounding of the previous solution using techniques described in Section |4~T1 
denoted as CONV-R, and (c) the low-rank constrained problem in Eq. (0 with || • \q = || • ||i 
and || • \\r = || ■ || 2 with varying A and M, denoted as NoCONV, and which is the standard 
method in sparse dictionary learning ll8ll2l[TTi. 

For the three methods and for each replication, we select the two regularization parameters 
that lead to the minimum value \\X — Yq\\ 2 , and compute the relative improvement on using the 
singular value decomposition (SVD) of Y. If the value is negative, denoising is better than with 
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-J.D± J.Z 


-43.3 ±Z.U 


1 f. A _l_ 1 A 

-10. 4± 1.4 


T f\-L 1 "2 
- /.U± 1.3 


3 


1 r\ 
1U 


ZU 


o 

z 


-O.D± J.O 


O A -I- 1 C 

-V.U±l.o 


-o.4± 1 .y 


-13.U±Z. / 


11 C -LI c 

-1 1 .J ± 1 .J 


-1U.J ± l.j 


4 


ZU 


ZU 


z 


-Z4.V± J.J 


-13.U±U. / 


1 r\ a j- 1 1 
-11). 4± 1.1 


/IA Q-Wy T 

-4U.V±Z.Z 


1 C O _l_f\ Q 


1/1 q i n ^7 
-14.0 ztU. / 






4.0 


9 

z, 


U.U ZL 


e q_i_ i c 


o A-)-i 4 


I m\J _J_ Z_ . U 


-10 1+1 a 


-Q Q4- 1 f, 
y .y zn i .u 


6 


20 


40 


2 


-13.2±2.6 


-12.3±1.4 


-11.5±1.3 


-25.4 ±3.0 


-16.7±1.3 


-15.6±1.4 


7 


10 


10 


4 


1.7±3.9 


-1.5 ±0.5 


-0.2±0.2 


-l.9±2.5 


-1.7±0.6 


-0.1±0.1 


8 


20 


10 


4 


-16.7 ±5.9 


-1.4±0.8 


-0.0±0.0 


-27.l±l.8 


-3.0±0.7 


0.0±0.0 


9 


10 


20 


4 


2.2±2.4 


-2.5 ±0.9 


-1.7±0.8 


2.0±2.9 


-2.5 ±0.8 


-1.2±1.0 


10 


20 


20 


4 


-1.2±2.5 


-3.1 ±1.1 


-0.9±0.9 


-I2.l±3.0 


-5.5±1.0 


-1.6±1.0 


11 


10 


40 


4 


3.5±3.0 


-3.3±1.3 


-3.3 ±1.5 


2.6±0.9 


-3.3 ±0.5 


-3.3±0.5 


12 


20 


40 


4 


3.7±2.3 


-3.9 ±0.6 


-3.6±0.8 


-l.7±l.7 


-6.3 ±0.9 


-5.3±0.8 


13 


10 


10 


8 


9.6±3.4 


-0.1 ±0.1 


0.0±0.0 


7.2±3.0 


-0.1 ±0.1 


0.0±0.0 


14 


20 


10 


8 


-1.6±3.7 


0.0±0.0 


0.0±0.0 


-4.8 ±2.3 


0.0±0.0 


0.0±0.0 


15 


10 


20 


8 


9.6±2.4 


-0.4 ±0.4 


-0.2±0.3 


9.4±l.5 


-0.4 ±0.4 


-0.2±0.2 


16 


20 


20 


8 


1 1.3± 1.8 


-0.2 ±0.2 


-0.0±0.0 


7.0±2.5 


-0.4 ±0.3 


-0.0±0.0 


17 


10 


40 


8 


8.8±3.0 


-0.8 ±0.7 


-0.7 ±0.7 


7.2±l.3 


-0.7 ±0.4 


-0.5±0.5 


18 


20 


40 


8 


10.9±1.1 


-0.9 ±0.6 


-0.6±0.5 


9.4±l.0 


-1.0 ±0.4 


-0.4±0.4 



Table 1 : Percentage of improvement in mean squared error, with respect to spectral denoising, for 
various parameters of the data generating process. See text for details. 
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the S VD (the more negative, the better). In Table \T\ we present averages over 10 replications 
for various values of N, P, M, and S. 

First, in these simulations where the decomposition coefficients are known to be sparse, 
penalizing by £ -norms indeed improves performance on spectral denoising for all methods. 
Second, as expected, the rounded formulation (CONV-R) does perform better than the non- 
rounded one (Conv), i.e., our rounding procedure allows to find "good" local minima of the 
non-convex problem in Eq. (Q]). 

Moreover, in high-sparsity situations (S = 2, lines 1 to 6 of Tabled)), we see that the rank- 
constrained formulation NoCONV outperforms the convex formulations, sometimes by a wide 
margin (e.g., lines 1 and 2). This is not the case when the ratio M/P becomes larger than 2 
(lines 3 and 5). In the medium-sparsity situation (S = 4, lines 7 to 12), we observe the same 
phenomenon, but the non-convex approach is better only when the ratio M/P is smaller than or 
equal to one. Finally, in low-sparsity situations (S = 8, lines 13 to 18), imposing sparsity does 
not improve performance much and the local minima of the non-convex approach NoCONV 
really hurt performance. Thus, from Table [Q we can see that with high sparsity (small S) and 
small relative dictionary size of the original non noisy data (i.e., low ratio Mj P), the non convex 
approach performs better. We are currently investigating theoretical arguments to support these 
empirical findings. 

6 Conclusion 

In this paper, we have investigated the possibility of convexifying the sparse dictionary learn- 
ing problem. We have reached both positive and negative conclusions: indeed, it is possible to 
convexify the problem by letting the dictionary size explicitly grow with proper regularization 
to ensure low rank solutions; however, it only leads to better predictive performance for prob- 
lems which are not too sparse and with large enough dictionaries. In the high-sparsity/small- 
dictionary cases, the non convex problem is empirically simple enough to solve so that our 
convexification leads to no gain. 

We are currently investigating more refined convexifications and extensions to nonnegative 
valiants 0, applications of our new decomposition norms to clustering [9], the possibility of 
obtaining consistency theorems similar to lfl4l for the convex formulation, and the application 
to the image denoising problem lf2l . 
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